Take-home Exercise 1: Geospatial Analytics for Social Good

Author

Amelia Chua

Published

Nov 2022

1 Overview

Water is an essential resource that not only supports life but also drives economic development. According to the World Bank, approximately 2 billion people in the world do not have safely managed drinking water services and 3.6 billion people lack safely managed sanitation services1. Developing countries are most affected by the shortage of water. The lack of ground water threatens their fight against poverty, food and water security and socio-economic development2.

1.1 About Water Point Data Exchange

Organisations like the World Bank, UNICEF and Water Point Data Exchange (WPdx) have various plans and schemes in place to combat this issue. In particular, WPdx has it in their mission to unlock potential of water point data to improve rural water services through evidence-based decision-making. It maintains a global data repository for data collected from rural areas at the water point or small water scheme level. A notable point is that data is formatted according to the WPdx Data Standard before being uploaded and published onto the repository. Using these information, decision support tools linked to the repository will be able to perform advanced analyses seamlessly3.

1.2 Objectives

Geospatial analytics hold tremendous potential to address complex societal problems like water shortage. In this study, I will apply appropriate global and local measures of spatial association techniques to reveal the spatial patterns of functional and non-functional water points. This will include will not limited to the following tasks:

  • Using appropriate sf method, import the shape file into R and save it in a simple feature data frame format. Note that there are three Projected Coordinate Systems of Nigeria, they are: EPSG: 26391, 26392, and 26303. You can use any one of them.

  • Using appropriate tidyr and dplyr methods, derive the proportion of functional and non-functional water point at LGA level.

  • Combining the geospatial and aspatial data frame into simple feature data frame.

  • Performing outliers/clusters analysis by using appropriate local measures of spatial association methods.

  • Performing hot spot area analysis by using appropriate local measures of spatial association methods.

1.3 The Study Area

The focus of this study would be Nigeria. Nigeria is located in West Africa and is the most populous country in Africa. The states are grouped into six geopolitical zones, the North Central, North East, North West, South West, South East and South4. UNICEF estimates that one third of Nigeria children do not have sufficient water to meet daily needs5.

2 Getting Started

2.1 Setting the Analytical Tools

The code chunk below installs and loads sf, spdep, tmap, tidyverse, patchwork packages into R environment. pacman() is a R package management tool.

Show the code
pacman::p_load(sf, spdep, tmap, tidyverse, patchwork)

3 Data Preparation

3.1 Data

As mentioned in the earlier section, the focus of this study is Nigeria. Two data sets will be used in this study. They are:

  • Nigeria Level-2 Administrative Boundary (also known as Local Government Area or LGA) polygon feature GIS data. The data was obtained from geoBoundaries.

  • WPdx+ data set that was obtained from Water Point Data Exchange (WPdx). It consists of water point related data from rural areas at the water point or small water scheme level. The entire set of data includes countries other than Nigeria. Hence, we will be performing data pre-processing to extract the relevant data.

    Tip

    The raw WPdx+ data file is 427mb and exceeds the upload limit of Github. In the next section, we will extract the relevant and necessary information, extract it into a .rds file and use the file for subsequent analysis. The raw file will not be pushed to Github to avoid crashing the Github repository.

3.2 Importing the data into R Environment

The geospatial data is in ESRI shape file format and the attribute table is in csv format.

3.2.1 Importing Geospatial data into R

The code chunk below uses st_read() function of sf package to import geoBoundaries-NGA-ADM2 shape file into R as a polygon feature data frame. The imported shape file will be a simple features object of sf.

Show the code
nigeria <- st_read(dsn = "data\\geospatial",
                   layer = "geoBoundaries-NGA-ADM2")
Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `C:\ameliachuayt\ISSS624\Exercises\Take-home_Ex1\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

From the output, we can see that there are 774 multipolygons features with 5 fields. nigeria is in WGS 84 coordinates system. The bounding box provides the x extend and y extend of the data.

To learn more about the attribute information, we can apply glimpse() of dplyr package.

Show the code
glimpse(nigeria)
Rows: 774
Columns: 6
$ shapeName  <chr> "Aba North", "Aba South", "Abadam", "Abaji", "Abak", "Abaka…
$ Level      <chr> "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "AD…
$ shapeID    <chr> "NGA-ADM2-72505758B79815894", "NGA-ADM2-72505758B67905963",…
$ shapeGroup <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NG…
$ shapeType  <chr> "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "AD…
$ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((7.401109 5...., MULTIPOLYGON (…

The printout above details the data type of each field. For instance, $ shapeName is in character data type.

3.2.2 Importing attribute data into R

The WPdx+ data set has 70 columns and 406,566 rows.

Show the code
wpdx <- read_csv("data\\aspatial\\Water_Point_Data_Exchange.csv", show_col_types = FALSE)

At any point in time when we wish to see the columns of a dataframe, we can use glimpse(). glimpse() allows us to see all columns and their data type in the data frame which is very helpful since we have 70 columns.

Show the code
glimpse(wpdx)

3.3 Data Wrangling - Geospatial Data

3.3.1 Rename Columns

For ease of referencing, let’s rename shapeName to LGA.

Show the code
nigeria <- nigeria %>% 
    rename(`LGA` = `shapeName`)

3.3.2 Check for Duplicates

Using duplicated() from R base functions, we can seek out LGA names that might be duplicated. This step is important for later part of the analysis.

Show the code
nigeria$LGA[duplicated(nigeria$LGA)==TRUE]

Based on the above, we have 6 LGAs that have the same name. A desk-based research using the coordinates showed that there are two scenarios that led to this duplication:

  1. Identically named LGAs are located in different states. For instances, there is a Bassa in Kogi State and Plateau State.
  2. There is a misspelling. In the case of Nasawara, one of it should be spelled as Nassarawa.

Let’s correct these errors:

Show the code
nigeria$LGA[94] <- "Bassa_Kogi"
nigeria$LGA[95] <- "Bassa_Plateau"
nigeria$LGA[304] <- "Ifelodun, Kwara"
nigeria$LGA[305] <- "Ifelodun, Osun"
nigeria$LGA[355] <- "Irepodun, Kwara"
nigeria$LGA[356] <- "Irepodun, Osun"
nigeria$LGA[519] <- "Nassarawa, Kano"
nigeria$LGA[546] <- "Obi, Benue"
nigeria$LGA[547] <- "Obi, Nasarawa"
nigeria$LGA[693] <- "Surulere, Lagos"
nigeria$LGA[694] <- "Surulere, Oyo"

Next, let’s double check that the corrections are made.

Show the code
nigeria$LGA[duplicated(nigeria$LGA)==TRUE]

3.4 Data Wrangling - Aspatial Data

The entire data set is large, and we need only to extract the relevant information required for the analysis. The focus of the study is Nigeria and the analysis will be done at the Level-2 Administrative Boundary (or LGA) level. Therefore, we will be performing steps to:

  1. Extract data belonging to Nigeria and
  2. Group water points according to their functional status at the LGA level.

Besides the above, we will also be performing data preparation and wrangling techniques to surface data issues and resolve them prior to the analysis. Before we start our data wrangling, it would be useful to inspect the metadata to understand what each column represents.

3.4.1 Extract data belonging to Nigeria

To learn which column(s) to use to filter for Nigeria’s data, we can inspect the metadata. We will not display the entire metadata here as it is lengthy. However, here is an excerpt of some columns:

Column Name Description
#clean_country_name Cleaned version of the country name based on provided GPS coordinates.
#clean_adm1 Cleaned version of the Primary Administrative Division data based on provided GPS coordinates and GADM boundaries.
#clean_adm2 Cleaned version of the Secondary Administrative Division data based on provided GPS coordinates and GADM boundaries.
#status_id Identify if any water is available on the day of the visit, recognizing that it may be a limited flow.
#status_clean Categorized version of the #status parameter. Based on terms from the #status entry, status_clean includes 5 categories: Fully functional, Functional but needs repair, Non functional and needs repair, Non functional due to dry season, Abandoned and Other. These categories will continue to evolve and will be refined in future updates.
#status Status of the physical/mechanical condition of the water point.

Based on the above, we can use #clean_country_name to filter out rows belonging to Nigeria. This can be done using the code chunk below. The filtered data set will be saved as wpdx_nigeria. We can also inspect the first few rows of the data by using head().

Show the code
wpdx_nigeria <- wpdx %>%
  filter(`#clean_country_name` == "Nigeria")

head(wpdx_nigeria)

Let’s use dim() to reveal the dimensions of the wpdx_nigeria.

Show the code
dim(wpdx_nigeria)

The output of the above code would reveal that wpdx_nigeria has 95,008 rows and 70 columns.

3.4.2 Resolving Misspellings

#status_clean provides us the status of the water points. Using the code chunk below, we use count() dplyr package to count the frequency of the each location and/or category.

Show the code
count(wpdx_nigeria, `#status_clean`)
Output of count(wpdx_nigeria, `#status_clean`)
#status_clean n
Abandoned 175
Abandoned/Decommissioned 234
Functional 45,883
Functional but needs repair 4,579
Functional but not in use 1,686
Non-Functional 29,385
Non-Functional due to dry season 2,403
Non Functional due to dry season 7
NA 10,656

From the output above, we observe two issues for the #status_clean column: misspellings, missing data and too many categories. Let’s tackle misspellings first.

We can easily see that there two similar categories “Non functional due to dry season” = “Non-Functional due to dry season”. One is spelled with a dash and one without. Let’s correct this by using the recode() from dplyr package.

Show the code
#recode 
wpdx_nigeria_clean <- wpdx_nigeria %>%
  mutate(`#status_clean` = recode(`#status_clean`, "Non functional due to dry season" = "Non-Functional due to dry season"))

We can confirm that the categories has been recoded correctly by running count() again.

Show the code
#re-run the frequency count
count(wpdx_nigeria_clean, `#status_clean`)

3.4.3 Missing values

We also note that there are 10,656 missing values in the #status_clean column. Let’s rename all the NA as ‘Unknown’.

Show the code
#recode
wpdx_nigeria_clean <- wpdx_nigeria_clean %>%
    mutate(`#status_clean` = replace_na(`#status_clean`, "Unknown"))
Show the code
count(wpdx_nigeria_clean, `#status_clean`)

3.4.4 Aggregate to reduce categories

In our study, we would like know the functional and non-functional water points. Therefore, we can actually aggregate our data into three categories: functional, non-functional and unknown as follows:

Old Category New Category
Abandoned Non-Functional
Abandoned/Decommissioned Non-Functional
Functional Functional (no change)
Functional but needs repair Functional
Functional but not in use Functional
Non-Functional Non-Functional (no change)
Non-Functional due to dry season Non-Functional
Unknown Unknown (no change)

We will create a new column that states whether the water point if functional or not using the code chunk below.

Show the code
#recode
wpdx_nigeria_clean <- wpdx_nigeria_clean %>%
  mutate(`Functional_Status` = `#status_clean`) %>%
  mutate(`Functional_Status` = recode(`Functional_Status`,
        "Abandoned" = "Non-Functional",
        "Abandoned/Decommissioned" = "Non-Functional",
        "Functional but needs repair" = "Functional",
        "Functional but not in use" = "Functional",
        "Non-Functional due to dry season" = "Non-Functional"))

Again, we can re-run the frequency count to confirm the recode has been performed.

Show the code
#re-run the frequency count
count(wpdx_nigeria_clean, `Functional_Status`)

3.4.5 Rename Columns

Let us rename #clean_adm2 to LGA.

Show the code
wpdx_nigeria_clean <- wpdx_nigeria_clean %>% 
    rename(`LGA` = `#clean_adm2`)

3.4.6 Drop Unwanted Columns

Show the code
wpdx_nigeria_simple <- subset(wpdx_nigeria_clean , 
                              select = c("LGA", "#lat_deg", 
                                        "#lon_deg","Functional_Status"))

#"#water_source_clean",
# "#water_source_category", "#distance_to_primary_road",
# "#distance_to_secondary_road", "#distance_to_tertiary_road",
# "#distance_to_city", "#distance_to_town", "usage_capacity",
# "is_urban", "cluster_size", 

3.4.7 Creating a simple feature data frame

Next, we will create a simple feature data frame from wpdx_nigeria_simple. This is done using the code chunk below.

Show the code
wpdx_nigeria_sf <- st_as_sf(wpdx_nigeria_simple, 
                    coords = c("#lon_deg","#lat_deg"), 
                    crs=4326) 

We can examine the content of this newly created simple feature data frame using the following code chunk.

Show the code
glimpse(wpdx_nigeria_sf)

From the output, we see that a new column called geometry has been added and the original #lon_deg and #lat_deg columns have been removed.

We can also use st_geometry() to retrieve the geometry list-column as shown in the code chunk below.

Show the code
st_geometry(wpdx_nigeria_sf)

The output displays basic information of the feature class such as type of geometry, the geographic extent of the features and the coordinate system of the data. What we can see is that wpdx_nigeria_sf is in the WGS 84 coordinate system.

3.4.8 Point in Polygon Count

Show the code
functional <- wpdx_nigeria_sf %>%
  filter(`Functional_Status` == 'Functional')

non_functional <- wpdx_nigeria_sf %>%
  filter(`Functional_Status` == 'Non-Functional')

unknown <- wpdx_nigeria_sf %>%
  filter(`Functional_Status` == 'Unknown')

Next, we can count the number of water points in each LGA using the following code chunk. Two operations are happening at the same time. First, the code chunk identifies water points located inside each LGA by using st_intersects(). Next, lengths() of Base R is used to calculate the number of water points that fall inside each LGA.

Show the code
nigeria$wpt_functional <- lengths(st_intersects(nigeria, functional))
nigeria$wpt_nonfunctional <- lengths(st_intersects(nigeria, non_functional))
nigeria$wpt_unknown <-lengths(st_intersects(nigeria, unknown))
nigeria$wpt_total <- lengths(st_intersects(nigeria, functional)) + 
  lengths(st_intersects(nigeria, non_functional)) + 
  lengths(st_intersects(nigeria, unknown))
Show the code
sum(nigeria$wpt_total)

3.4.9 Derive new features

Show the code
nigeria <- nigeria %>%
  mutate(pct_functional = `wpt_functional`/ `wpt_total`) %>%
  mutate(`pct_functional` = replace_na(`pct_functional`, 0)) %>%
  mutate(pct_nonfunctional = `wpt_nonfunctional`/ `wpt_total`) %>%
  mutate(`pct_nonfunctional` = replace_na(`pct_nonfunctional`, 0))

3.4.10 Transforming Coordinate System

We will transform nigeria from geographic coordinate system to projected coordinate system. We need to do this transformation because the geographic coordinate system is inappropriate if the analysis require the use of distance and/or area measurements. This would be at the later stage where we compute distance-based contiguity weight matrices.

There are three Projected Coordinate Systems of Nigeria: EPSG: 26391, 26392, and 26303. For this study, we will be EPSG 26391. We can use the st_transform() of the sf package to re-project nigeria from one coordinate system to another coordinate system mathematically.

We will save a copy of the nigeria in the geographic coordinate system before the projection.

Show the code
nigeria_gcs <- nigeria 
nigeria <- st_transform(nigeria, crs = 26391)

Next, let us display the content of nigeria sf data frame as shown below.

Show the code
st_geometry(nigeria)

Notice that it is in projected coordinate system now. Furthermore, if you refer to Bounding box:, the values are greater than 0-360 range of decimal degree commonly used by most of the geographic coordinate systems.

3.4.11 Saving the Analytical Data Table

Next, let’s save our cleaned data into .rds data format files using the write_rds() of readr package. The output file is called wp_nga.rds and it is saved in rds sub-folder. We do this to shorten the loading time and more importantly, we can avoid uploading the large raw files onto GitHub.

Show the code
write_rds(nigeria, "data\\rds\\wp_nga.rds")

Note: WPdx+ also offers the data in shape file format. We can also download that version for our analysis. This is covered here, if you wish to learn more.

4 Exploratory Data Analysis

Let’s load the .rds file using the readRDS() function.

Show the code
nigeria <- readRDS("data\\rds\\wp_nga.rds")

To avoid skewing the subsequent analysis, we will exclude LDAs without any water points (functional, non-functional and unknown).

Show the code
nigeria <- nigeria %>%
  filter(nigeria$wpt_total != 0)

There are 13 LGAs without any water points. The number of LGAs decreased from 774 to 761.

4.1 Water Point Density using Histograms

While our interest is in geographical distribution of functional and non-functional water points, it would be interesting to see the water point density too. To do this, we must first compute the area of LGA and then the water point density.

The code chunk below uses st_area() of sf package to derive the area of each LGA. We are creating a new column Area to store the area values.

Show the code
nigeria$Area <- nigeria %>%
  st_area()
Show the code
nigeria <- nigeria %>%
  mutate(`wpt_func_density` = (`wpt_functional` / Area * 1000000))
Show the code
ggplot(data=nigeria,
       aes(x=as.numeric(`wpt_func_density`))) +
  geom_histogram(bins=20,
                 color="black",
                 fill = "pink") + 
  labs(title = "Are functional water points evenly distributed in Nigeria?",
       subtitle = "While there are many LGAs less than one waterpoint per km sq, there is 1 LGA with 10 \nfunctional water points per km sq.",
       x = "Functional Water Point density (per km sq)",
       y = "Frequency")

We can repeat the same for non-functional water points.

Show the code
nigeria <- nigeria %>%
  mutate(`wpt_nonfunc_density` = (`wpt_nonfunctional` / Area * 1000000))

ggplot(data=nigeria,
       aes(x=as.numeric(`wpt_nonfunc_density`))) +
  geom_histogram(bins=20,
                 color="black",
                 fill = "orange") + 
  labs(title = "Are non-functional water points evenly distributed in Nigeria?",
       subtitle = "It would be good to know that most areas have less than one non-functional water point \nper km sq.",
       x = "Non-Functional Water Point density (per km sq)",
       y = "Frequency")

4.2 Distribution of water points in Nigeria

There are 94,979 water points in Nigeria. This can be calculated in the code chunk below.

Show the code
sum(nigeria$wpt_total)
[1] 94979

Let’s find out the distribution of the various functional status of water points in Nigeria and use ggplot2 to visualise.

Show the code
nigeria %>%
  pivot_longer(c(wpt_functional, wpt_nonfunctional,wpt_unknown),
               names_to = "status", values_to = "count") %>%
  count(status, wt = count) %>%
  ggplot(aes(status)) + geom_bar(aes(weight = n))

As we can see in the figure above, more than half of the water points are functional, slightly more than a third are non-functional with the remaining unknown.

In the code chuck below, we will use tmap to plot the spatial distribution of water points of different categories. We will use tmap_arrange() to show the plots together.

Before this, we will create a helper function that will help us to plot the choropleths with ease.

Show the code
# input: the dataframe and the variable name, chart style, title 
choropleth_plot <- function(df, varname, style, title) {
  tm_shape(df) +
    tm_fill(varname, 
          n= 5,
          style = style) +
    tm_borders(alpha = 0.5) +
    tm_layout(main.title = title,
              main.title.size = 1,
              main.title.position = "center",
              legend.height = 0.45, 
              legend.width = 0.35,
              frame = TRUE)+ 
    tm_compass(position = c('left','bottom'))
}

Here, we will plot the distributions of all, unknown, functional and non-functional water points.

Show the code
tmap_mode("plot")
tmap_arrange(
choropleth_plot(nigeria, "wpt_total", "pretty", "All water points in Nigeria: \npartitioned by 'pretty' intervals"),
choropleth_plot(nigeria, "wpt_unknown", "pretty", "Status unknown water points in Nigeria: \n partitioned by 'pretty' intervals"),
choropleth_plot(nigeria, "wpt_functional", "pretty", "Functional water points in Nigeria: \n partitioned by 'pretty' intervals"),
choropleth_plot(nigeria, "wpt_nonfunctional", "pretty", "Non-functional water points in Nigeria: \n partitioned by 'pretty' intervals"))

The choropleths above are partitioned using the default or ‘pretty’ intervals. We can observe that the number of water points (total) are not evenly spread across evenly–we see mostly lighter shades of orange and certain areas with darker orange. Water points with status unknown are more common in the southern part of the map as compared to the northern parts of Nigeria. In terms of functional water points, we can see areas in the top middle of the map are in darker shades of orange which indicates a higher number of functional water points compared to other areas. For non-functional water points, we see that the outer regions of the maps have lesser non-functional water points. Moving towards the center of the map, we see more non-functional water points. Note that this may be because of the total of water points in the area. For instance, if there is a small number (e.g., <10) of water points in an area and the worse case scenario is that all are non-functional. Compared to an area with >50 water points, having 10 non-functioning water points is only 20% of its supply source.

Let’s try to plot the same charts, this time, using the quantile partitioning method.

Show the code
tmap_mode("plot")
tmap_arrange(
choropleth_plot(nigeria, "wpt_total", "quantile", "All water points in Nigeria: \npartitioned by 'quantile' intervals"),
choropleth_plot(nigeria, "wpt_unknown", "quantile", "Status unknown water points in Nigeria: \n partitioned by 'quantile' intervals"),
choropleth_plot(nigeria, "wpt_functional", "quantile", "Functional water points in Nigeria: \n partitioned by 'quantile' intervals"),
choropleth_plot(nigeria, "wpt_nonfunctional", "quantile", "Non-functional water points in Nigeria: \n partitioned by 'quantile' intervals"))

The quantile partitioning method creates intervals with an equal number of features i.e., polygons. We can make different observations using this map. When looking at total water points, we can see that north eastern areas and south western areas have relatively less water points as compared to the other areas. This pattern is also seen in the functional water points and non-functional water points choropleth maps. We can see that the northern parts of Nigeria have relatively higher number of functional water points, and relatively lesser non-functional water points as compared to other areas. As we can see in the above, plotting the graphs in “quantiles” gives us a better sense of relative levels of water points as compared to the “pretty” method.

We can also plot our choropleths using its proportions.

Show the code
#Repeat plot with the prportions
tmap_mode("plot")
tmap_arrange(
choropleth_plot(nigeria, "pct_functional", "pretty", "Proportion of Functional water points \nin Nigeria: partitioned by 'pretty' intervals"),
choropleth_plot(nigeria, "pct_nonfunctional", "pretty", "Proportion of Non-functional water points \nin Nigeria: partitioned by 'pretty' intervals"))

The above two maps are plotted in the same scale (refer to the legend) which makes it easy for comparison between the two. The top 1/3 of the maps seem to have a higher proportion of functional water points than non-functional. The bottom 1/3 of the maps seem to have a slightly higher proportional of non-functional water points than functional ones.

We can also see that there is somewhat an inverse relationship when the proportion is >0.60. This means that the darker areas in one map would likely show up as light areas in the other. For instance, the top of the functional water points maps is in darker shades of orange while the same areas are in lighter shades in the non-functional water points map. We say that it is ‘likely’ the case because there are areas with status unknown water points as well.

5 Exploratory Spatial Data Analysis

Exploratory Spatial Data Analysis or ESDA is an extension of Exploratory Data Analysis. It consists of descriptive techniques to discover spatial distribution of data and identify outliers6.

In this section, we will cover global and local spatial autocorrelation. The former focuses on the overall trend while the latter enables us to identify hot and cold spots in the data.

5.1 Global Spatial Autocorrelation

According to Tobler’s First Law of Geography, “everything is related to everything else, but near things are more related than distant things.”

This sub-section will cover the computation of global spatial autocorrelation statistics and spatial complete randomness test for global spatial autocorrelation. The goal of these analyses is to understand whether water points are evenly distributed across Nigeria.

5.1.1 Spatial Weights Matrix

To compute global spatial autocorrelation, we first need to construct spatial weights of the Nigeria. Spatial relationships are multi-directional and multi-lateral (Ref: Compulsory Reading). We can use Spatial Weights to define spatial neighbourhood for subsequent spatial analysis. There are two commonly used methods of spatial weights: contiguity-based and distanced-based.

In contiguity-based, neighbours share a common boundary which is considered differently in different methods. In Rook contiguity, neighbours have a common edge. In Queen contiguity, neighbours need to only share a common vertex or edge. So in comparison, Queen contiguity is more encompassing as compared to Rook contiguity7.

The differences between the two is illustrated in the picture below.

Contiguity Matrix

In distance-based contiguity, we have fixed weighting and adaptive weighting schemes. The former considers two regions are neighbours if they are within a specified distance from one another. In the latter scheme, each region will have the same number of neighbours. The number of neighbour is specified beforehand. If k = 8 neighbours, it classifies the nearest 8 regions as neighbours.

Which method of spatial weights method to use depends on the geographical location we are working with. If the geographical location consists of many isolated islands, then contiguity-based matrix may yield many regions with no neighbours. If the sizes of the features (polygons) are wide ranging where you have very large features and relatively smaller features, then contiguity-based matrix may also result in larger features having many more neighbours which may skew the results–there would be a smoothing effect to the larger number of neighbours.

5.1.2 Contiguity-based Spatial Weights

In the code chunk below, poly2nb() of spdep package is used to compute contiguity weight matrices for the study area. This function builds a neighbours list based on regions with contiguous boundaries. By default, Queen contiguity is applied.

5.1.2.1 Contiguity-based (Queen) Spatial Weight Contiguity

Show the code
wm_q <- poly2nb(nigeria, 
                queen = TRUE)
summary(wm_q)
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 4348 
Percentage nonzero weights: 0.750793 
Average number of links: 5.713535 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  4  16  57 122 177 137 121  71  39  11   4   1   1 
4 least connected regions:
136 497 513 547 with 1 link
1 most connected region:
496 with 14 links

The summary report above shows that there are 761 area units in Nigeria. The most connected region has 14 links. There are four least connected regions with only one neighbour.

5.1.2.2 Contiguity-based (Rook) Spatial Weight Contiguity

In the code chunk below, poly2nb() of spdep package is used to compute contiguity weight matrices for the study area. We specify queen = FALSE to compute Rook contiguity.

Show the code
wm_r <- poly2nb(nigeria, 
                queen = FALSE)
summary(wm_r)
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 4328 
Percentage nonzero weights: 0.7473395 
Average number of links: 5.687254 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  14 
  4  16  59 124 176 138 123  65  40  10   4   1   1 
4 least connected regions:
136 497 513 547 with 1 link
1 most connected region:
496 with 14 links

The summary report above shows that there are 751 area units in Nigeria. The most connected region has 14 links. There are four least connected regions with only one neighbour.

A summary of the results are below. In our case, both methods yield similar results.

Queen Contiguity Rook Contiguity
No. of regions with no links 0 0
Average number of links 5.71 5.69

5.1.3 Distance-based Contiguity Weight Matrix

We will derive the distance-based weight matrices by using dnearneigh() of spdep package. The function identifies neighbours of region points by Euclidean distance with a distance band with lower and upper bounds controlled by the bounds argument or by Great Circle distance in kilometres if longlat argument is set to TRUE

5.1.3.1 Fixed Distance Weight Matrix

Determining the cut-off distance

To ensure that each region has at least one neighbour, we need to find out the minimum distance within which all regions have at least one neighbour. We can do this by following these steps:

  • Getting the coordinates of polygon centroids. This is required as an input in the next step.

    We need to associate each polygon with a point and its coordinates need to be in a separate data frame. We will use a mapping function that applies a given function to each element of a vector and returns a vector of the same length. Our input vector will be the geometry column of nigeria. Our function will be st_centroid(). We will be using map_dbl() variation of map from the purrr package. purrr is loaded when we load tidyverse package.

    To get our longitude values we map the st_centroid() function over the geometry column of nigeria and access the longitude value through double bracket notation [[]] and 1. This allows us to get only the longitude, which is the first value in each centroid.

    Show the code
    longitude <- map_dbl(nigeria$geometry, ~st_centroid(.x)[[1]])

    We do the same for latitude with one key difference. We access the second value per each centroid with [[2]]

    Show the code
    latitude <- map_dbl(nigeria$geometry, ~st_centroid(.x)[[2]])

    Now that we have latitude and longitude, I used cbind to put longitude and latitude into the same object. We should check the first few observations to see if things are formatted correctly.

    Show the code
    coords <- cbind(longitude, latitude)
    head(coords, 5)
         longitude latitude
    [1,]  549364.0 123694.9
    [2,]  547123.4 120376.5
    [3,]  489057.4 534262.6
    [4,]  593718.2 113824.1
    [5,]  642618.7 251222.3
  • Return a matrix with the indices of points belonging to the set of the k nearest neighbours of each other by using knearneigh() of spdep.

  • Convert the knn object returned by knearneigh() into a neighbours list of class nb with a list of integer vectors containing neighbour region number ids by using knn2nb().

  • Return the length of neighbour relationship edges by using nbdists() of spdep. The function returns in the units of the coordinates if the coordinates are projected, in km otherwise.

  • Remove the list structure of the returned object by using unlist().

Show the code
k1 <- knn2nb(knearneigh(coords, k = 1))
k1dists <- unlist(nbdists(k1, coords))
summary(k1dists)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2669   12808   20008   21804   27013   72139 

The summary report shows that the largest first nearest neighbour distance is 72,139 metres, so using a number slightly larger than this (i.e. 72,200) as the upper threshold gives certainty that all regions will have at least one neighbour.

Using the code chunk below, we discover that the region with the maximum distance to its nearest neighbour is Sardauna.

Show the code
nigeria$LGA[match(max(k1dists), k1dists)]
[1] "Sardauna"

Computing the fixed distance weight matrix

Now, we will compute the distance weight matrix by using dnearneigh() as shown below.

Show the code
wm_d72 <- dnearneigh(coords,0,72200)
wm_d72
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 18050 
Percentage nonzero weights: 3.116793 
Average number of links: 23.71879 

From the output, we see that the average number of links is 23.7. The number is quite high and may skew the analysis.

Next, we will use str() to display the content of wm_d72 weight matrix.

Show the code
str(wm_d72)
List of 761
 $ : int [1:63] 2 4 9 24 54 65 67 101 120 179 ...
 $ : int [1:62] 1 4 9 24 54 65 67 101 120 179 ...
 $ : int [1:10] 11 19 252 257 438 445 457 628 677 682
 $ : int [1:56] 1 2 54 65 102 134 135 167 182 200 ...
 $ : int [1:21] 8 13 17 18 55 168 215 216 324 331 ...
 $ : int [1:19] 7 14 21 174 175 212 275 276 277 289 ...
 $ : int [1:32] 6 14 21 48 174 175 212 269 270 271 ...
 $ : int [1:26] 5 17 18 55 65 76 101 102 215 216 ...
 $ : int [1:64] 1 2 22 24 65 101 179 188 189 201 ...
 $ : int [1:22] 25 26 42 67 124 155 188 189 202 330 ...
 $ : int [1:11] 3 133 252 257 394 409 421 438 445 677 ...
 $ : int [1:13] 30 36 37 39 92 209 314 387 428 462 ...
 $ : int [1:24] 5 168 191 192 193 215 303 304 305 356 ...
 $ : int [1:27] 6 7 21 31 48 50 61 81 174 175 ...
 $ : int [1:37] 29 37 38 40 43 44 69 70 118 122 ...
 $ : int [1:34] 27 28 34 71 170 171 176 177 180 269 ...
 $ : int [1:30] 5 8 18 55 65 76 101 102 215 216 ...
 $ : int [1:42] 5 8 17 24 55 65 76 101 102 179 ...
 $ : int [1:7] 3 104 237 257 411 445 457
 $ : int [1:9] 59 60 160 263 474 508 565 583 613
 $ : int [1:31] 6 7 14 31 48 50 61 81 174 175 ...
 $ : int [1:64] 9 24 51 52 53 55 57 76 77 78 ...
 $ : int [1:5] 121 466 514 660 748
 $ : int [1:68] 1 2 9 18 22 53 55 65 76 101 ...
 $ : int [1:30] 10 26 42 67 155 188 189 202 330 364 ...
 $ : int [1:24] 10 25 42 67 155 189 202 330 364 365 ...
 $ : int [1:43] 16 28 34 69 70 122 170 171 176 177 ...
 $ : int [1:45] 16 27 34 69 70 122 170 171 176 177 ...
 $ : int [1:30] 15 37 38 39 40 43 44 173 183 184 ...
 $ : int [1:13] 12 36 92 156 208 209 210 283 302 548 ...
 $ : int [1:28] 14 21 48 50 61 81 175 194 205 212 ...
 $ : int [1:29] 46 109 128 140 143 153 164 217 225 231 ...
 $ : int [1:10] 41 102 134 135 211 369 540 546 720 744
 $ : int [1:33] 16 27 28 71 170 171 176 177 180 269 ...
 $ : int [1:8] 49 105 244 400 424 446 668 746
 $ : int [1:21] 12 30 37 38 39 40 184 190 195 196 ...
 $ : int [1:25] 12 15 29 36 38 39 40 43 184 190 ...
 $ : int [1:27] 15 29 36 37 39 40 43 183 184 190 ...
 $ : int [1:21] 12 29 36 37 38 40 43 184 190 209 ...
 $ : int [1:22] 15 29 36 37 38 39 43 44 184 190 ...
 $ : int [1:19] 33 134 135 182 200 279 280 369 488 525 ...
 $ : int [1:19] 10 25 26 67 120 124 155 188 189 243 ...
 $ : int [1:27] 15 29 37 38 39 40 44 69 173 184 ...
 $ : int [1:27] 15 29 40 43 69 173 185 186 190 284 ...
 $ : int [1:12] 117 374 381 409 415 421 430 450 509 643 ...
 $ : int [1:24] 32 109 125 128 153 164 225 232 236 239 ...
 $ : int [1:12] 63 64 73 111 129 259 380 399 420 472 ...
 $ : int [1:30] 7 14 21 31 50 61 81 174 175 205 ...
 $ : int [1:4] 35 105 401 424
 $ : int [1:27] 14 21 31 48 61 81 175 205 212 278 ...
 $ : int [1:47] 22 52 53 56 57 76 77 78 79 163 ...
 $ : int [1:37] 22 51 53 56 57 77 78 79 163 187 ...
 $ : int [1:58] 22 24 51 52 55 56 57 76 77 78 ...
 $ : int [1:33] 1 2 4 67 120 155 167 182 188 206 ...
 $ : int [1:51] 5 8 17 18 22 24 53 65 76 77 ...
 $ : int [1:35] 51 52 53 57 77 78 79 163 187 195 ...
 $ : int [1:37] 22 51 52 53 56 77 78 163 187 195 ...
 $ : int [1:5] 126 127 482 687 735
 $ : int [1:14] 20 60 156 263 304 305 548 550 565 576 ...
 $ : int [1:11] 20 59 160 262 263 474 565 576 579 583 ...
 $ : int [1:28] 14 21 31 48 50 81 175 194 205 212 ...
 $ : int [1:5] 378 408 458 752 759
 $ : int [1:7] 47 64 73 111 129 259 399
 $ : int [1:11] 47 63 73 107 111 259 380 399 670 688 ...
 $ : int [1:47] 1 2 4 8 9 17 18 24 55 101 ...
 $ : int [1:26] 71 118 122 177 180 298 299 333 340 341 ...
 $ : int [1:30] 1 2 10 25 26 42 54 120 155 188 ...
 $ : int [1:7] 138 144 245 268 489 500 501
 $ : int [1:44] 15 27 28 43 44 70 118 122 170 171 ...
 $ : int [1:50] 15 27 28 69 118 122 170 171 173 176 ...
 $ : int [1:20] 16 34 66 180 342 355 368 372 397 553 ...
 $ : int [1:6] 355 368 371 397 652 653
 $ : int [1:14] 47 63 64 107 111 114 247 259 659 670 ...
 $ : int [1:15] 108 227 250 253 266 367 376 392 414 425 ...
 $ : int [1:9] 249 281 419 450 461 534 634 664 738
 $ : int [1:56] 8 17 18 22 24 51 53 55 77 78 ...
 $ : int [1:51] 22 51 52 53 55 56 57 76 78 79 ...
 $ : int [1:57] 22 51 52 53 55 56 57 76 77 79 ...
 $ : int [1:39] 22 51 52 53 56 76 77 78 163 187 ...
 $ : int [1:19] 97 143 225 231 239 250 264 418 440 473 ...
 $ : int [1:21] 14 21 31 48 50 61 175 205 212 291 ...
 $ : int [1:6] 130 253 377 406 516 754
 $ : int [1:3] 146 429 679
 $ : int [1:38] 99 103 128 140 143 153 154 217 233 239 ...
 $ : int [1:19] 145 147 149 219 224 242 261 393 402 407 ...
 $ : int [1:5] 148 479 635 687 701
 $ : int [1:11] 98 105 157 255 400 454 529 661 663 668 ...
 $ : int 235
 $ : int [1:2] 158 265
 $ : int [1:11] 93 117 384 385 386 415 477 629 643 655 ...
 $ : int [1:3] 348 594 652
 $ : int [1:7] 12 30 156 428 548 583 696
 $ : int [1:10] 90 384 385 386 398 415 460 643 695 757
 $ : int [1:17] 95 106 137 165 166 344 383 396 404 412 ...
 $ : int [1:13] 94 106 112 137 145 166 383 396 412 442 ...
 $ : int [1:4] 151 229 424 683
 $ : int [1:18] 80 143 152 165 225 231 250 264 418 440 ...
 $ : int [1:12] 87 113 222 244 248 255 267 400 446 449 ...
 $ : int [1:35] 84 128 140 153 154 217 232 233 258 261 ...
  [list output truncated]
 - attr(*, "class")= chr "nb"
 - attr(*, "region.id")= chr [1:761] "1" "2" "3" "4" ...
 - attr(*, "call")= language dnearneigh(x = coords, d1 = 0, d2 = 72200)
 - attr(*, "dnn")= num [1:2] 0 72200
 - attr(*, "bounds")= chr [1:2] "GE" "LE"
 - attr(*, "nbtype")= chr "distance"
 - attr(*, "sym")= logi TRUE

We can observe that each region has varying number of neighbours.

Show the code
par(mfrow = c(1,2))
plot(nigeria$geometry, border = "lightgrey",main="1st nearest neighbours" )
plot(k1, coords, add = TRUE, col = "red", length = 0.88, )

plot(nigeria$geometry, border = "lightgrey", main = "Distance Link")
plot(wm_d72, coords, add = TRUE, pch = 19, cex = 0.6)

Due to a high number of links, we have very dense graphs which make it difficult to interpret. However, we can still make some observations:

  • The above charts actually illustrates a characteristic of fixed distance weight matrix–more densely settled areas (usually the urban areas) tend to have more neighbours and the less densely settled areas (usually the rural counties) tend to have lesser neighbours.

  • Based on the above charts, we can tell that the geographical areas of the regions in Nigeria are largely varying. In the top and bottom of the charts, we see the neighbour links are very dense and less dense in the western and eastern regions (where you can see pockets of white space).

5.1.3.2 Adaptive Distance-based Weight Matrix

To overcome the issue of fixed distance weight matrix where there is uneven distribution of neighbours, we can use directly control the numbers of neighbours using k-nearest neighbours, as shown in the code chunk below.

As a rule-of-thumb, we will set k = 8 i.e., all regions will have 8 neighbours.

Show the code
knn8 <- knn2nb(knearneigh(coords, k=8))
knn8
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 6088 
Percentage nonzero weights: 1.051248 
Average number of links: 8 
Non-symmetric neighbours list

Plotting Adaptive Distance-based Neighbours

Show the code
par(mfrow = c(1,2))
plot(nigeria$geometry, border = "lightgrey",main="8 nearest neighbours" )
plot(knn8, coords, add = TRUE, col = "red", length = 0.88, )

plot(nigeria$geometry, border = "lightgrey", main = "Distance Link w KNN")
plot(knn8, coords, add = TRUE, pch = 19, cex = 0.6)

5.1.4 Which spatial weight matrix to use?

Selecting a spatial weight matrix is use is dependent on the geographical area of interest and the focus of the study8. In our case, between contiguity-based and distance-based spatial weight matrices, we lean towards distance-based matrices. Within distance-based matrices, we will select the adaptive distance-based spatial weight matrix for our subsequent analysis.

The reasons are summarised here:

  • Nigeria has 761 LGAs with varying sizes. Hence, a contiguity-based matrix will have the issue where larger LGAs have more neighbours and smaller LGAs have lesser neighbours. This would likely skew our analysis. Therefore, distance-based methods are preferred.

  • As mentioned earlier, the fixed distance-based method has the disadvantage that some regions would only have 1 neighbour, while on average regions have 23 neighbours. Statistical test for regions with only 1 neighbour may not be valid.

Based on the above, we will select adaptive distance-based spatial weight matrix.

5.1.5 Row-Standardised Weights Matrix

After selecting the weight matrix to use, we will now assign weights to each neighboring polygon. Each neighboring polygon will be assigned equal weight (style=“W”) by assigning the fraction 1/(#of neighbors) to each neighbouring area. This is also known as a row-standardised matrix where each row in the matrix sums to 1.

Show the code
rswm_knn8 <- nb2listw(knn8,
                   style = "W",
                   zero.policy = TRUE)
rswm_knn8
Characteristics of weights list object:
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 6088 
Percentage nonzero weights: 1.051248 
Average number of links: 8 
Non-symmetric neighbours list

Weights style: W 
Weights constants summary:
    n     nn  S0       S1       S2
W 761 579121 761 171.0938 3102.562

We will be using the row-standardised weight matrix for the next part of the analysis.

5.1.6 Computing Global Spatial Autocorrelation Statistics

This in sub-section, we will use two methods: Moran’s I and Geary’s C to test the hypothesis the following hypothesis:

  • H0: Observed spatial patterns of values is equally likely as any other spatial pattern i.e. data is randomly disbursed, no spatial pattern

  • H1: Data is more spatially clustered than expected by chance alone.

5.1.6.1 Moran’s I

We will perform Moran’s I statistical testing by using moran.test() of spdep. Moran’s I describe how features differ from the values in the study area as a whole. The Moran I statistic ranges from -1 to 1. If the Moran I is:

  • positive (I>0): Clustered, observations tend to be similar

  • negative (I<0): Disperse, observations tend to be dissimilar

  • approximately zero: observations arranged randomly over space

The below code chunk will perform the Moran’s I test on both functional and non-functional water points.

Show the code
moran.test(nigeria$pct_functional,
           listw = rswm_knn8,
           zero.policy = TRUE,
           na.action = na.omit)

    Moran I test under randomisation

data:  nigeria$pct_functional  
weights: rswm_knn8    

Moran I statistic standard deviate = 30.692, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.523237285      -0.001315789       0.000292099 
Show the code
moran.test(nigeria$pct_nonfunctional,
           listw = rswm_knn8,
           zero.policy = TRUE,
           na.action = na.omit)

    Moran I test under randomisation

data:  nigeria$pct_nonfunctional  
weights: rswm_knn8    

Moran I statistic standard deviate = 26.388, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.4496039329     -0.0013157895      0.0002919996 

In both cases, since the p-value < 0.05, we have sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone. Since Moran I statistics are larger than 0, the observation are clustered, observations tend to be similar.

Computing Monte Carlo Moran’s I

If we have doubts that the assumptions of Moran’s I are true (normality and randomisation), we can use a Monte Carlo simulation to perform a permutation test for Moran’s I.

The permutation tests consists of randomly reassigning the attribute values to a cell under the assumption of no spatial pattern. This random assignment is conducted n times. Each time, we will compute the Moran’s I to creating an empirical distribution of Moran’s I under H0.

The code chunk below performs permutation test for Moran’s I statistic by using moran.mc() of spdep. A total of 1000 simulation will be performed.

Show the code
set.seed(1234)
bperm_func = moran.mc(nigeria$pct_functional, 
         listw = rswm_knn8,
         nsim = 999,
         zero.policy = TRUE,
         na.action = na.omit)
bperm_func

    Monte-Carlo simulation of Moran I

data:  nigeria$pct_functional 
weights: rswm_knn8  
number of simulations + 1: 1000 

statistic = 0.52324, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater
Show the code
set.seed(1234)
bperm_nonfunc = moran.mc(nigeria$pct_nonfunctional, 
         listw = rswm_knn8,
         nsim = 999,
         zero.policy = TRUE,
         na.action = na.omit)
bperm_nonfunc

    Monte-Carlo simulation of Moran I

data:  nigeria$pct_nonfunctional 
weights: rswm_knn8  
number of simulations + 1: 1000 

statistic = 0.4496, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater

In both cases, since the pseudo p-value is < 0.05, we have sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone for both functional and non-functional water points.

Visualising Monte Carlo Moran’s I

We can examine the simulated Moran’s I test statistics in greater detail through plotting the distribution of the statistical values as a histogram by using the code chunks below.

Let’s visualise Monte Carlo Moran’s I using a histogram. This can be created by using ggplot2 package.

Show the code
df <- as.data.frame(bperm_func$res)
colnames(df) <- c("Simulated Moran's I")

moran_mc_func <- ggplot(df, aes(x=`Simulated Moran's I`)) + 
  geom_histogram(color = "black", fill = "grey", bins = 25) +
  xlim(-0.1,0.6) + 
  ylab('Frequency') + 
  geom_vline(xintercept = 0, color = 'red') +
  geom_vline(xintercept = 0.523 , color = 'blue') + 
  ggtitle("Histogram of Monte Carlo Simulated \nMoran's I (Functional WP)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5)) +
  annotate("text", x = 0.35, y = 410, label = "Actual Moran's I", color = 'blue') 

df <- as.data.frame(bperm_nonfunc$res)
colnames(df) <- c("Simulated Moran's I")

moran_mc_nonfunc <- ggplot(df, aes(x=`Simulated Moran's I`)) + 
  geom_histogram(color = "black", fill = "grey", bins = 25) +
  xlim(-0.1,0.6) + 
  ylab('Frequency') + 
  geom_vline(xintercept = 0, color = 'red') +
  geom_vline(xintercept = 0.450 , color = 'blue') + 
  ggtitle("Histogram of Monte Carlo Simulated \nMoran's I (Non-Functional WP)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5)) +
  annotate("text", x = 0.28, y = 410, label = "Actual Moran's I", color = 'blue') 

moran_mc_func + moran_mc_nonfunc

In both cases, the actual Moran’s I value (blue line) is near the extremes of the distribution of the simulated data. This suggests a statistically significant relationship and evidence of positive autocorrelation i.e. cluster9.

5.1.6.2 Geary’s C

Geary’s C considers the difference between respective observations10--this means that it describe how features differ from their immediate neighbours. Geary’s C range from -1 to an undefined number above 1. If the Geary’s C is:

  • Large (c>1): Dispersed, observations tend to be dissimilar
  • Small (c<1): Clustered, observations tend to be similar
  • c = 1: observations arranged randomly over space

The code chunk below performs Geary’s C test for spatial autocorrelation for functional and non-functional water points using geary.test() of spdep.

Show the code
geary.test(nigeria$pct_functional, listw = rswm_knn8)

    Geary C test under randomisation

data:  nigeria$pct_functional 
weights: rswm_knn8 

Geary C statistic standard deviate = 29.377, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic       Expectation          Variance 
      0.470163795       1.000000000       0.000325287 
Show the code
geary.test(nigeria$pct_nonfunctional, listw = rswm_knn8)

    Geary C test under randomisation

data:  nigeria$pct_nonfunctional 
weights: rswm_knn8 

Geary C statistic standard deviate = 25.29, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic       Expectation          Variance 
     0.5393883718      1.0000000000      0.0003317272 

In both cases, since the p-value < 0.05, we have sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone. The Geary C statistics are less than 1 suggesting that clusters are present in both cases. This finding is consistent with the results of the Global Moran’s I test in the previous section.

Computing Monte Carlo Geary’s C

Similar to Moran’s I, we can also use Monte Carlo simulation to perform a permutation test for Geary’s C. The code chunk below performs permutation test for Geary’s C statistic by using geary.mc() of spdep. A total of 1000 simulation will be performed.

Show the code
set.seed(1234)
bperm_func = geary.mc(nigeria$pct_functional,
                 listw = rswm_knn8,
                 nsim = 999)
bperm_func

    Monte-Carlo simulation of Geary C

data:  nigeria$pct_functional 
weights: rswm_knn8 
number of simulations + 1: 1000 

statistic = 0.47016, observed rank = 1, p-value = 0.001
alternative hypothesis: greater
Show the code
set.seed(1234)
bperm_nonfunc = geary.mc(nigeria$pct_nonfunctional,
                 listw = rswm_knn8,
                 nsim = 999)
bperm_nonfunc

    Monte-Carlo simulation of Geary C

data:  nigeria$pct_nonfunctional 
weights: rswm_knn8 
number of simulations + 1: 1000 

statistic = 0.53939, observed rank = 1, p-value = 0.001
alternative hypothesis: greater

Since the pseudo p-value = 0.001 < 0.05, we have sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone.

Visualising the Monte Carlo Geary’s C

We can examine the simulated Geary’s C test statistics in greater detail through plotting the distribution of the statistical values as a histogram by using the code chunks below. This can be created by using ggplot2 package.

Show the code
df <- as.data.frame(bperm_func$res)
colnames(df) <- c("Simulated Geary's C")

geary_mc_func <- ggplot(df, aes(x=`Simulated Geary's C`)) + 
  geom_histogram(color = "black", fill = "grey", bins = 25) +
  xlim(0.4,1.2) + 
  ylab('Frequency') + 
  geom_vline(xintercept = 0, color = 'red') +
  geom_vline(xintercept = 0.470 , color = 'blue') + 
  ggtitle("Histogram of Monte Carlo Simulated \nGeary's C (Non-Functional WP)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5)) +
  annotate("text", x = 0.30, y = 410, label = "Actual Geary's C", color = 'blue') 


df <- as.data.frame(bperm_nonfunc$res)
colnames(df) <- c("Simulated Geary's C")

geary_mc_nonfunc <- ggplot(df, aes(x=`Simulated Geary's C`)) + 
  geom_histogram(color = "black", fill = "grey", bins = 25) +
  xlim(0.4,1.2) + 
  ylab('Frequency') + 
  geom_vline(xintercept = 0, color = 'red') +
  geom_vline(xintercept = 0.540 , color = 'blue') + 
  ggtitle("Histogram of Monte Carlo Simulated \nGeary's C (Non-Functional WP)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5)) +
  annotate("text", x = 0.37, y = 410, label = "Actual Geary's C", color = 'blue') 

geary_mc_func + geary_mc_nonfunc

In both cases, the actual Geary’s C value (blue line) is near the extremes of the distribution of the simulated data. This suggests a statistically significant relationship and evidence of positive autocorrelation i.e. cluster.

5.1.7 Spatial Correlogram

Spatial correlograms are great to examine patterns of spatial autocorrelation in your data or model residuals. They show how correlated are pairs of spatial observations when you increase the distance (lag) between them - they are plots of some index of autocorrelation (Moran’s I or Geary’s c) against distance.

In spatial correlograms, the number of bins determines the distance range of each bin. The range is the maximum distance divided by the number of bins11.

5.1.7.1 Moran’s I Correlogram

In the code chunk below, sp.correlogram() of spdep package is used to compute a 8-lag spatial correlogram for Moran’s I and the autocorrelation coefficient. The plot() of base Graph is then used to plot the output.

Show the code
moran_corr_func <- sp.correlogram(knn8, #non-weighted spatial weights
                          nigeria$pct_functional,
                          order = 8,
                          method = 'I', #Moran's I
                          style = 'W') #weighed
plot(moran_corr_func)

Show the code
moran_corr_nonfunc <- sp.correlogram(knn8, #non-weighted spatial weights
                          nigeria$pct_nonfunctional,
                          order = 8,
                          method = 'I', #Moran's I
                          style = 'W') #weighed
plot(moran_corr_nonfunc)

We can use the following code chunk if we are interested to find out the distance range of each bin / lag.

Show the code
nb8 <- nblag(knn8, 8)
correlogram_bins <- sapply(nb8, 
                           function(x) mean(unlist(nbdists(x, coords))))
correlogram_bins
[1]  37386.50  73925.69 112081.23 152528.12 194617.08 238153.84 283553.45
[8] 329774.69

Next, let’s examine the full analysis report and view which values are statistically significant.

Show the code
print(moran_corr_func)
Spatial correlogram for nigeria$pct_functional 
method: Moran's I
           estimate expectation    variance standard deviate Pr(I) two sided
1 (761)  5.2324e-01 -1.3158e-03  2.9210e-04           30.692       < 2.2e-16
2 (761)  4.3338e-01 -1.3158e-03  1.4782e-04           35.754       < 2.2e-16
3 (761)  3.6793e-01 -1.3158e-03  1.0116e-04           36.713       < 2.2e-16
4 (761)  3.2705e-01 -1.3158e-03  7.5025e-05           37.910       < 2.2e-16
5 (761)  2.8922e-01 -1.3158e-03  5.8720e-05           37.914       < 2.2e-16
6 (761)  2.4635e-01 -1.3158e-03  4.8735e-05           35.478       < 2.2e-16
7 (761)  2.0785e-01 -1.3158e-03  4.2396e-05           32.124       < 2.2e-16
8 (761)  1.7834e-01 -1.3158e-03  3.7887e-05           29.188       < 2.2e-16
           
1 (761) ***
2 (761) ***
3 (761) ***
4 (761) ***
5 (761) ***
6 (761) ***
7 (761) ***
8 (761) ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
print(moran_corr_nonfunc)
Spatial correlogram for nigeria$pct_nonfunctional 
method: Moran's I
           estimate expectation    variance standard deviate Pr(I) two sided
1 (761)  4.4960e-01 -1.3158e-03  2.9200e-04          26.3881       < 2.2e-16
2 (761)  3.1842e-01 -1.3158e-03  1.4777e-04          26.3026       < 2.2e-16
3 (761)  2.3165e-01 -1.3158e-03  1.0112e-04          23.1671       < 2.2e-16
4 (761)  1.5268e-01 -1.3158e-03  7.5000e-05          17.7818       < 2.2e-16
5 (761)  7.5056e-02 -1.3158e-03  5.8701e-05           9.9681       < 2.2e-16
6 (761)  3.9018e-02 -1.3158e-03  4.8719e-05           5.7786       7.532e-09
7 (761)  3.1312e-02 -1.3158e-03  4.2382e-05           5.0118       5.393e-07
8 (761)  2.2785e-02 -1.3158e-03  3.7874e-05           3.9161       8.997e-05
           
1 (761) ***
2 (761) ***
3 (761) ***
4 (761) ***
5 (761) ***
6 (761) ***
7 (761) ***
8 (761) ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output, we see that all results in both cases are statistically significant at the 95% level of confidence. The correlogram depicts how the spatial autocorrelation changes with distance. From the chart, we can see that Moran’s I decreases when spatial lag increases. This means that there is a quite strong spatial autocorrelation that decreases as spatial lag increases.

Comparing functional and non-functional water points, we can see that in the latter, from lag-5 onwards, the spatial autocorrelation < 0.1 and nears 0. In the case of functional water points, the Moran’s I value at lag 5 is around 0.3.

5.1.7.2 Geary’s C Correlogram

Similarly, we can do the same for Geary’s C.

Show the code
geary_corr_func <- sp.correlogram(knn8, #non-weighted spatial weights
                          nigeria$pct_functional,
                          order = 8,
                          method = 'C', #Geary's C
                          style = 'W') #weighed
plot(geary_corr_func)

Show the code
geary_corr_nonfunc <- sp.correlogram(knn8, #non-weighted spatial weights
                          nigeria$pct_nonfunctional,
                          order = 8,
                          method = 'C', #Geary's C
                          style = 'W') #weighed
plot(geary_corr_nonfunc)

Let’s use the code chunk below to print the full analysis report.

Show the code
print(geary_corr_func)
Spatial correlogram for nigeria$pct_functional 
method: Geary's C
          estimate expectation   variance standard deviate Pr(I) two sided    
1 (761) 4.7016e-01  1.0000e+00 3.2529e-04          -29.377       < 2.2e-16 ***
2 (761) 5.7163e-01  1.0000e+00 1.9203e-04          -30.912       < 2.2e-16 ***
3 (761) 6.3944e-01  1.0000e+00 1.5802e-04          -28.682       < 2.2e-16 ***
4 (761) 6.7225e-01  1.0000e+00 1.2514e-04          -29.299       < 2.2e-16 ***
5 (761) 7.1004e-01  1.0000e+00 1.0807e-04          -27.891       < 2.2e-16 ***
6 (761) 7.3684e-01  1.0000e+00 9.8181e-05          -26.558       < 2.2e-16 ***
7 (761) 7.6512e-01  1.0000e+00 9.8474e-05          -23.669       < 2.2e-16 ***
8 (761) 7.8220e-01  1.0000e+00 1.0972e-04          -20.792       < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
print(geary_corr_nonfunc)
Spatial correlogram for nigeria$pct_nonfunctional 
method: Geary's C
          estimate expectation   variance standard deviate Pr(I) two sided    
1 (761) 0.53938837  1.00000000 0.00033173         -25.2897       < 2.2e-16 ***
2 (761) 0.67620379  1.00000000 0.00020058         -22.8630       < 2.2e-16 ***
3 (761) 0.74920671  1.00000000 0.00016899         -19.2922       < 2.2e-16 ***
4 (761) 0.82171931  1.00000000 0.00013480         -15.3552       < 2.2e-16 ***
5 (761) 0.89720975  1.00000000 0.00011759          -9.4791       < 2.2e-16 ***
6 (761) 0.92911305  1.00000000 0.00010771          -6.8302       8.479e-12 ***
7 (761) 0.95078872  1.00000000 0.00010928          -4.7075       2.507e-06 ***
8 (761) 0.97228216  1.00000000 0.00012356          -2.4935         0.01265 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output, we see that all results are statistically significant at the 95% level of confidence for both cases.

The correlogram depicts how the spatial autocorrelation changes with distance. From the chart, we can see that Geary’s C increases when spatial lag increases. This is unsurprising, given that Moran’s I and Geary’s C are inversely related. Also, all Geary C’s values are < 1 which suggests clustering.

In the case of non-functional water points, from Lag 5, the values are >= 0.9, nearing 1 which suggests little to no cluster or dissimilar observations. On the other hand, the values for functional water points are <= 0.8 for all lags.

5.2 Local Spatial Autocorrelation Statistics

In the previous section, we have established through statistical testing that spatial clustering of non-functional water points occur in Nigeria. Now, we seek to detect clusters or outliers and discover if there are any hot or cold spots of non-functional water points using Local Spatial Autocorrelation Statistics. They include: Anselin’s Moran Scatterplot and Local Indicators of Spatial Autocorrelation (LISA) and Getis-Ord Gi Statistics.

5.2.1 Cluster and Outlier Analysis

5.2.1.1 Local Moran’s I

The Local Moran’s I is a local spatial autocorrelation statistic based on the Moran’s I statistic. It was developed by Anselin (1995) as a LISA statistic. Anselin defines LISA to have two properties:

  1. “The LISA for each observation gives an indication of the extent of significant spatial clustering of similar values around that observation”; and

  2. “the sum of LISAs for all observations is proportional to a global indicator of spatial association.”

The first point means that given an attribute of interest, for each region in Nigeria, the LISA indicates existence and extent of spatial clustering of regions with similar attribute values12.

Positive Local Moran’s I value indicates that a feature has neighboring features with similarly high or low attribute values; this feature is part of a cluster. Negative Local Moran’s I value indicates that a feature has neighboring features with dissimilar values; this feature is an outlier13.

Computing Local Moran’s I

Before we can map the values, we first need to compute them using the localmoran() function of spdep package. It computes Ii values, given a set of zi values and a listw object providing neighbour weighting information for the polygon associated with the zi values.

The code chunk below is used to compute the Local Moran’s I of percentage of functional and non-functional water points at the LGA level.

Show the code
fips <- order(nigeria$LGA)
localMI_func <- localmoran(nigeria$pct_functional, rswm_knn8)
head(localMI_func)
           Ii          E.Ii      Var.Ii       Z.Ii Pr(z != E(Ii))
1  0.37995419 -2.159064e-04 0.020344291  2.6653649    0.007690482
2  0.38571528 -2.312021e-04 0.021785225  2.6148457    0.008926777
3 -0.07290031 -2.549836e-04 0.024025489 -0.4686747    0.639302139
4  0.05542823 -1.839603e-05 0.001733752  1.3316242    0.182983719
5  0.47953038 -5.726499e-04 0.053940028  2.0671827    0.038716939
6  0.12006509 -3.150906e-05 0.002969563  2.2038623    0.027534029
Show the code
fips <- order(nigeria$LGA)
localMI_nonfunc <- localmoran(nigeria$pct_nonfunctional, rswm_knn8)
head(localMI_nonfunc)
          Ii          E.Ii     Var.Ii      Z.Ii Pr(z != E(Ii))
1  0.6506454 -0.0008294518 0.07810908  2.331025    0.019752051
2  0.5944165 -0.0005017373 0.04726386  2.736483    0.006209989
3  0.1855012 -0.0016465926 0.15493203  0.475460    0.634459103
4  0.8661740 -0.0007449612 0.07015857  3.272941    0.001064347
5  0.6411397 -0.0010566011 0.09947700  2.036134    0.041736858
6 -0.1639149 -0.0001770915 0.01668751 -1.267515    0.204971134

localmoran() function returns a matrix of values whose columns are:

  • Ii: the local Moran’s I statistics

  • E.Ii: the expectation of local moran statistic under the randomisation hypothesis

  • Var.Ii: the variance of local moran statistic under the randomisation hypothesis

  • Z.Ii:the standard deviate of local moran statistic

  • Pr(): the p-value of local moran statistic

The code chunk below list the content of the local Moran matrix derived by using printCoefmat().

Show the code
printCoefmat(data.frame(
  localMI_func[fips,], 
  row.names=nigeria$LGA[fips]),
  check.names=FALSE)
Show the code
printCoefmat(data.frame(
  localMI_nonfunc[fips,], 
  row.names=nigeria$LGA[fips]),
  check.names=FALSE)

Mapping Local Moran’s I values and p-values

Next, we will append the local Moran’s I dataframe (i.e. localMI_func and localMI_nonfunc) onto nigeria SpatialPolygonDataFrame in preparation for the next part. This can be done using the code chunks below.

Show the code
nigeria.localMI_func <- cbind(nigeria,localMI_func) %>%
  rename(Pr.Ii = Pr.z....E.Ii..)
Show the code
nigeria.localMI_nonfunc <- cbind(nigeria,localMI_nonfunc) %>%
  rename(Pr.Ii = Pr.z....E.Ii..)

We must also consider the p-values for each of the Local Moran’s I values before interpreting them. Let’s visualise both the Local Moran’s I values and its p-values using the choropleth mapping functions of tmap package

Show the code
localMI_func.map <- tm_shape(nigeria.localMI_func) +
  tm_fill(col = "Ii", 
          style = "pretty",
          title = "Local Moran I Statistics") +
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Local Moran's I Map \n(Functional WP)",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

pvalue_func.map <- tm_shape(nigeria.localMI_func) + 
                tm_fill(col = "Pr.Ii",
                       breaks = c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
                       palette = "-Blues",
                       title = "Local Moran's I p-values") + 
                tm_borders(alpha = 0.3)+ 
  tm_layout(main.title = "Local Moran's I p-values Map \n(Functional WP)",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

tmap_arrange(localMI_func.map, pvalue_func.map, asp = 1, ncol = 2)

Show the code
localMI_nonfunc.map <- tm_shape(nigeria.localMI_nonfunc) +
  tm_fill(col = "Ii", 
          style = "pretty",
          title = "Local Moran I Statistics") +
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Local Moran's I Map \n(Non-Functional WP)",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

pvalue_nonfunc.map <- tm_shape(nigeria.localMI_nonfunc) + 
                tm_fill(col = "Pr.Ii",
                       breaks = c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
                       palette = "-Blues",
                       title = "Local Moran's I p-values") + 
                tm_borders(alpha = 0.3)+ 
  tm_layout(main.title = "Local Moran's I p-values Map \n(Non-Functional WP)",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

tmap_arrange(localMI_nonfunc.map, pvalue_nonfunc.map, asp = 1, ncol = 2)

Something extra that we can also do is to build a choropleth and shade only regions that are statistically significant. This can be done using the code chunk below. We first create a new object nigeria.localMI_sig_f that consists of statistically significant values for functional water points. Then we plot a base map that consists of just the polygons features. Lastly, we overlay the base map with a new map that consists of the statistically significant local Moran’s I value. We repeat this for non-functional water points.

Show the code
#Functional
nigeria.localMI_sig_f <- cbind(nigeria,localMI_func) %>%
  rename(Pr.Ii = Pr.z....E.Ii..) %>%
  filter(Pr.Ii < 0.05)

base <- tm_shape(nigeria) + 
  tm_fill(col = 'gray98') + 
  tm_borders(alpha = 0.3)

localMI_sig_f.map <- base + 
  tm_shape(nigeria.localMI_sig_f) +
  tm_fill(col = "Ii", 
          style = "pretty",
          title = "Local Moran I Statistics") +
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Local Moran's I (Sig.) Map \n(Functional WP) ",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

#Non-Functional
nigeria.localMI_sig_nf <- cbind(nigeria,localMI_nonfunc) %>%
  rename(Pr.Ii = Pr.z....E.Ii..) %>%
  filter(Pr.Ii < 0.05)

base <- tm_shape(nigeria) + 
  tm_fill(col = 'gray98') + 
  tm_borders(alpha = 0.3)

localMI_sig_nf.map <- base + 
  tm_shape(nigeria.localMI_sig_nf) +
  tm_fill(col = "Ii", 
          style = "pretty",
          title = "Local Moran I Statistics") +
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Local Moran's I (Sig.) Map \n(Non-Functional WP) ",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

tmap_arrange(localMI_sig_f.map,localMI_sig_nf.map)

Recall that a positive Local Moran’s I value indicates that a feature is part of a cluster and a negative Local Moran’s I value indicates that a feature is an outlier.

For the functional water points map, we see most regions are in shades of green which corresponds to them being in cluster(s). This is similar for the non-functional water points map. There are areas of overlap on both maps (top) and also areas where both maps differ (bottom of the map). Later on, we will combine this analysis with LISA cluster map to derive more information.

5.2.1.2 Anselin’s Moran Scatterplot

The Anselin’s Moran Scatterplot allows us to assess how similar or dissimilar an observed value is to it’s neighbouring observations. It is a visualisation tool that gives us a visual representation of spatial associations in the neighbourhood around each observation14.

Standardising Variable

Before we plot the scatterplot, we can also standardise the variable first. This will result in the scatterplot to be centered on the coordinates (0,0) which may be easier to interpret.

We can use scale() to center and scale the variable. Centering is done by subtracting the mean from the corresponding columns, and scaling is done by dividing the (centered) variable by their standard deviations.

After running the code chunk below, a new column Z.pct_nonfunctional will be created to store the standardised values. The as.vector() added to the end is to make sure that the data type we get out of this is a vector, that map neatly into out dataframe.

Show the code
nigeria$Z.pct_functional <- scale(nigeria$pct_functional) %>%
  as.vector
nigeria$Z.pct_nonfunctional <- scale(nigeria$pct_nonfunctional) %>%
  as.vector

head(nigeria,3) 
Simple feature collection with 3 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 481557.7 ymin: 116934.8 xmax: 552734.5 ymax: 579396
Projected CRS: Minna / Nigeria West Belt
        LGA Level                    shapeID shapeGroup shapeType
1 Aba North  ADM2 NGA-ADM2-72505758B79815894        NGA      ADM2
2 Aba South  ADM2 NGA-ADM2-72505758B67905963        NGA      ADM2
3     Abaji  ADM2 NGA-ADM2-72505758B61968000        NGA      ADM2
                        geometry wpt_functional wpt_nonfunctional wpt_unknown
1 MULTIPOLYGON (((552560.3 12...              7                 9           1
2 MULTIPOLYGON (((545153.6 12...             29                35           7
3 MULTIPOLYGON (((510602.3 57...             23                34           0
  wpt_total pct_functional pct_nonfunctional            Area   wpt_func_density
1        17      0.4117647         0.5294118  18723031 [m^2] 0.37387109 [1/m^2]
2        71      0.4084507         0.4929577  43383884 [m^2] 0.66845099 [1/m^2]
3        57      0.4035088         0.5964912 832968699 [m^2] 0.02761208 [1/m^2]
  wpt_nonfunc_density Z.pct_functional Z.pct_nonfunctional
1  0.48069141 [1/m^2]       -0.4048126           0.7934450
2  0.80675119 [1/m^2]       -0.4189065           0.6171056
3  0.04081786 [1/m^2]       -0.4399237           1.1179293

In the output above, we can see the two new columns added.

Now, we can plot the Moran scatterplot using the standardised values and moran.plot() of spdep.

Show the code
nci_func <- moran.plot(nigeria$Z.pct_functional, rswm_knn8,
                  labels = as.character(nigeria$LGA),
                  xlab = "Pct Non-Functional Water Points",
                  ylab = "Spatially Lag Pct Non-Functional Water Points")

Show the code
nci_nonfunc <- moran.plot(nigeria$Z.pct_nonfunctional, rswm_knn8,
                  labels = as.character(nigeria$LGA),
                  xlab = "Pct Non-Functional Water Points",
                  ylab = "Spatially Lag Pct Non-Functional Water Points")

In both cases, we will find that the output of the Moran Scatterplots to be consistent with the Moran’s I scores tabulated in the previous sub-section. The Moran scatterplot has 4 quadrants:

  • High-High” - Top right corner : Positive Autocorrelation Cluster. Regions here have high percentage of non-functional water points and are surrounded by other areas that have the higher than average level of non-functional water points.

  • “Low-High” - Top left corner: Negative Autocorrelation Cluster. Regions here have low percentage of non-functional water points and are surrounded by other areas that have the higher than average level of non-functional water points.

  • “High-Low” - Bottom right corner: Negative Autocorrelation Cluster. Regions here have high percentage of non-functional water points and are surrounded by other areas that have the lower than average level of non-functional water points.

  • Low-Low” - Bottom left corner : Positive Autocorrelation Cluster. Regions here have low percentage of non-functional water points and are surrounded by other areas that have the lower than average level of non-functional water points.

Unfortunately, the Moran scatterplot does not tells us which points/regions are significant. To overcome this, we can use the LISA Cluster Maps.

5.2.1.3 LISA Cluster Maps

LISA Cluster Maps also categorises each region into one of five groups: (1) High-High, (2) High-Low, (3) Low-High, (4) Low-Low and (5) Insignificant.

We can prepare a LISA cluster map by following these steps:

  1. Create a vector of the same length as the number of LGA in Nigeria
  2. Derive a variable, DV by using a by using the spatially lagged version (lag_GDPPC) of the variable of interested (GDPPC) and center it around its means. When DV > 0, the spatially lagged variable of the region is higher than the mean.
  3. Derive a variable, L_MI using the Local Moran’s I.
  4. Set the significance level for the local Moran.
  5. Define the command lines for: high-high, low-low, low-high, high-low
  6. Place statistically insignificant Moran I in the category 0.
Show the code
#Step 1
quadrant <- vector(mode = 'numeric', length = nrow(localMI_func))
#Step 2
nigeria$lag_pct_func <- lag.listw(rswm_knn8, nigeria$pct_functional)
DV_func <- nigeria$lag_pct_func - mean(nigeria$lag_pct_func)     
#Step 3
LM_I_func <- localMI_func[,1] 
#Step 4
signif <- 0.05
#Step 5
quadrant[DV_func <0 & LM_I_func>0] <- 1 #low-low
quadrant[DV_func >0 & LM_I_func<0] <- 2 #high-low
quadrant[DV_func <0 & LM_I_func<0] <- 3 #low-high
quadrant[DV_func >0 & LM_I_func>0] <- 4 #high-high
#Step 6
quadrant[localMI_func[,5]>signif] <- 0

This is the code chunk to prepare the LISA cluster map.

Show the code
#Assign each region  to its respective quardrant
nigeria.localMI_func$quadrant <- quadrant
#Set the colours--one for each quadrant
colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c") 

clusters <- c("insignificant", "low-low", "low-high", "high-low", "high-high")

LISAmap_func <- tm_shape(nigeria.localMI_func) + 
  tm_fill(col = "quadrant",
          style = "cat", 
          palette = colors[c(sort(unique(quadrant)))+1],
          labels = clusters[c(sort(unique(quadrant)))+1])  + 
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "LISA Cluster Map \n (Functional WP) ",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

We will repeat the same steps for non-functional water points.

Show the code
#Step 1
quadrant <- vector(mode = 'numeric', length = nrow(localMI_nonfunc))
#Step 2
nigeria$lag_pct_nonfunc <- lag.listw(rswm_knn8, nigeria$pct_nonfunctional)
DV_nonfunc <- nigeria$lag_pct_nonfunc - mean(nigeria$lag_pct_nonfunc)     
#Step 3
LM_I_nonfunc <- localMI_nonfunc[,1] 
#Step 4
signif <- 0.05
#Step 5
quadrant[DV_nonfunc <0 & LM_I_nonfunc>0] <- 1 #low-low
quadrant[DV_nonfunc >0 & LM_I_nonfunc<0] <- 2 #high-low
quadrant[DV_nonfunc <0 & LM_I_nonfunc<0] <- 3 #low-high
quadrant[DV_nonfunc >0 & LM_I_nonfunc>0] <- 4 #high-high
#Step 6
quadrant[localMI_nonfunc[,5]>signif] <- 0

This is the code chunk to prepare the LISA cluster map.

Show the code
#Assign each region  to its respective quardrant
nigeria.localMI_nonfunc$quadrant <- quadrant
#Set the colours--one for each quadrant
colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c") 

clusters <- c("insignificant", "low-low", "low-high", "high-low", "high-high")

LISAmap_nonfunc <- tm_shape(nigeria.localMI_nonfunc) + 
  tm_fill(col = "quadrant",
          style = "cat", 
          palette = colors[c(sort(unique(quadrant)))+1],
          labels = clusters[c(sort(unique(quadrant)))+1])  + 
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "LISA Cluster Map \n (Non-Functional WP) ",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

We can plot the local Moran’s I map (statistically significant values only) and LISA map together. What we will observe is that the shaded regions are the same for each pair of maps.

Show the code
tmap_arrange(localMI_sig_f.map, LISAmap_func,
  localMI_sig_nf.map, LISAmap_nonfunc, ncol =2)

The LISA maps provides another level of information–which is whether the region have relatively higher or lower percentage of non-functional water points.

Functional: From the Local Moran’s I map, we can make out that regions have positive Local Moran’s I values, suggesting that they have neighbours with similarly high or low percentage of functional water points. This would be consistent with the dark blue (low-low) and red (high-high) regions in the LISA map. We can see that the top areas are in red while bottom are in blue.

Non-functional: From the Local Moran’s I map, we can make out that regions have positive Local Moran’s I values, suggesting that they have neighbours with similarly high or low percentage of non-functional water points. This would be consistent with the dark blue (low-low) and red (high-high) regions in the LISA map. We can see that the top areas are in blue while bottom are in red.

5.2.1.4 Interpretation of Results

Overall, Local Moran’s have revealed significant spatial cluster and outliers.

Functional Water points

For ease of comparison, here is the map of the percentage of functional water points and the LISA map. I have set the mode to allow for interactive viewing - this means that we can zoom in / out of the map and click on the LGAs too.

Show the code
tmap_mode("view")
tmap_arrange(
choropleth_plot(nigeria, "pct_functional", "pretty", "Proportion of Functional water points \nin Nigeria: partitioned by 'pretty' intervals"), 
             LISAmap_func)

At a glance, in the LISA map, we can see two larger clusters–a high-high (red) cluster situation near the top and a low-low (blue) cluster situation near the bottom of the map. We also notice no significant spatial clusters or outliers in the middle/central area of the map. It may be geographically related. The top area borders nearby countries like Niger and Chad while the bottom are coastal areas bordering the Gulf of Guinea.

There are four high-high clusters areas. The regions in this area and their neighbours have a high percentage of functional water points The largest one is at the northern portion of the map. In that area, there are pockets of low-high outliers (light blue)--meaning that these areas have a lower percentage of functional water points as compared to their neighbours and may be worth investigating. There is another high-high cluster on the top right. The LGAs sandwiched between the two high-high clusters may also be worth a deeper look.

There are around two larger low-low clusters (dark blue). The largest one is in the southern part of the country. There are also pockets of high-low outliers in the same parts. Interestingly, the cluster extends from the coastal areas and move towards the inner parts of the country in a C shape–the centre of the ‘C’ are not identified as cluster or outliers.

Non-Functional Water Points

Similarly, for ease of comparison, here is the map of the percentage of non-functional water points and the LISA map.

Show the code
tmap_mode("view")
tmap_arrange(
choropleth_plot(nigeria, "pct_nonfunctional", "pretty", "Proportion of Non-functional water points \nin Nigeria: partitioned by 'pretty' intervals"), 
             LISAmap_nonfunc, asp = 1, ncol = 2 )

We can see all categories of cluster/outlier classification on the LISA map. The first two observations we can make are that there are two large clusters–one low-low cluster situation near the top and a high-high cluster situation near the bottom of the map. We also notice no significant spatial clusters or outliers in the middle/central area of the map. It may be geographically related. The top area borders nearby countries like Niger and Chad while the bottom are coastal areas bordering the Gulf of Guinea.

There are around 6 low-low clusters (dark blue). The regions in this area and their neighbours have a low percentage of non-functional water points. The larger low-low cluster is relatively extensive. It extends from the top middle of the map to the top right corner and consists parts of states like Borno, Yobe, Dutse and Kano.

There are four high-high clusters (red). The regions in this area and their neighbours have a high percentage of non-functional water points The largest one extends from bottom, slightly-off centre of the map upwards. This includes LGAs in the states of Edo, Delta and Baylesa. Towards the right, there is another high-high cluster in the state of Cross River. The first and second clusters are relatively close to each other as compared to the remaining two. An interesting area to further study is Kwame and Fumakaye LGAs. They are high-high clusters situated near a large low-low cluster.

We see pockets of outliers throughout the country. There are two low-high outliers that are situated next to clusters. The regions have low percentage of non-functional water points while their neighbours have higher percentage of the same. We can see two outliers Bagudo and Fakai near the top left area of the map. The former is situated next to a high-high cluster. There is no low-high clusters towards the top and top right of the map, which is probably because the area has a large low-low spatial cluster. On the other hand, there are pockets of high-low outliers like Tudun Wada and Karaye LGAs.

States of Nigeria15

Comparing both maps, we can see a different in high-high and low-low clusters, it’s almost always inverse. Also, both maps have the similarity where the centre part of the map has no clusters or outliers identified.

5.2.2 Hot Spot Area Analysis

In the sub-section, we will use a local spatial autocorrelation statistic to detect hot or cold spots. Hot spot refers to areas that have higher values relative to its surroundings.

5.2.2.1 Getis and Ord’s G-Statistics

The Getis and Ord’s G-statistics (Getis and Ord, 1972; Ord and Getis, 1995) are alternative spatial statistics to detect spatial anomalies. It looks at each region within the context of its neighbouring features. A statistically significant hot spot will have high values and are surrounded by other areas with high values as well.

The Gi statistic is a z-score. For statistically significant and positive z-score, the higher the value, the more intense the clustering of high values is i.e. hot spot. Inversely, for a statistically significant negative z-score, the lower the value, the more intense the clustering of lower values is i.e. cold spot.

With reference to the context, we are hence looking for areas where the area itself and its neighbours have a high percentage of non-functional water points (hot spots) and areas where the areas itself and its neighbours have a lower percentage of non-functional water points.

The analysis consists of three steps:

  • Deriving spatial weight matrix

  • Computing Gi statistics

  • Mapping Gi statistic

Deriving spatial weights matrix

Getis-Ord defines neighbours based on distance. The Gi-statistics measures the degree of association that comes about from the concentration of weighted points and all other weighted points included within a radius of distance from the original point16. It included a weight component, which is a binary spatial weight matrix–where 1 represents a link.

Gi(d) Statistic, Getis, A., & Ord, K. (1992)

Therefore, we can use the previously derived knn8, and weight it with a binary weight matrix. This can be done using the code chunk here.

Show the code
bwm_knn8 <- nb2listw(knn8,
                   style = "B",
                   zero.policy = TRUE)
bwm_knn8
Characteristics of weights list object:
Neighbour list object:
Number of regions: 761 
Number of nonzero links: 6088 
Percentage nonzero weights: 1.051248 
Average number of links: 8 
Non-symmetric neighbours list

Weights style: B 
Weights constants summary:
    n     nn   S0    S1     S2
B 761 579121 6088 10950 198564

Computing Gi Statistics

To compute the Gi statistics, we can use localG() from spdep package.

Show the code
fips <- order(nigeria$LGA)
gi.adaptive_func <- localG(nigeria$pct_functional, bwm_knn8)
nigeria.gi_func <- cbind(nigeria, as.matrix(gi.adaptive_func)) %>%
  rename(gstat_adaptive = as.matrix.gi.adaptive_func.)
Show the code
fips <- order(nigeria$LGA)
gi.adaptive_nonfunc <- localG(nigeria$pct_nonfunctional, bwm_knn8)
nigeria.gi_nonfunc <- cbind(nigeria, as.matrix(gi.adaptive_nonfunc)) %>%
  rename(gstat_adaptive = as.matrix.gi.adaptive_nonfunc.)

Mapping GI Values with Adaptive Distance Weights

To visualise the locations of hot spot and cold spot areas, we can use the choropleth mapping functions of tmap package.

The below code chunk plots the map for functional water points.

Show the code
tmap_mode("plot")
Gimap_func <- tm_shape(nigeria.gi_func) + 
  tm_fill(col = "gstat_adaptive", 
          style = "pretty",
          palette = "-RdBu",
          title = "local Gi") + 
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Gi Map \n (Functional WP)",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

tmap_arrange(choropleth_plot(nigeria, "pct_functional", "pretty", "Proportion of Functional water points \nin Nigeria: partitioned by 'pretty' intervals"), 
             Gimap_func) 

The below code chunk plots the map for non-functional water points

Show the code
tmap_mode("plot")
Gimap_nonfunc <- tm_shape(nigeria.gi_nonfunc) + 
  tm_fill(col = "gstat_adaptive", 
          style = "pretty",
          palette = "-RdBu",
          title = "local Gi") + 
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Gi Map \n (Non-Functional WP) ",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

tmap_arrange(choropleth_plot(nigeria, "pct_nonfunctional", "pretty", "Proportion of Non-functional water points \nin Nigeria: partitioned by 'pretty' intervals"), 
             Gimap_nonfunc) 

5.2.2.2 Interpretation of Results

Before we interpret the charts, we should only consider the statistically significant spots. We can refer to the localg()documentation for the critical values of the statistic for the 95th percentile. Since our data has close to 1,000 records, the critical value is +/-3.886. This means that the Gi Statistic is statistically significant if it is more than or less than 3.886.

We can use the code chunk below to filter out only the statistically significant spots.

Show the code
nigeria.gi_sig_f <- cbind(nigeria, as.matrix(gi.adaptive_func)) %>%
  rename(gstat_adaptive = as.matrix.gi.adaptive_func.) %>%
  filter(gstat_adaptive > 3.886 | gstat_adaptive < -3.886 )

nigeria.gi_sig_nf <- cbind(nigeria, as.matrix(gi.adaptive_nonfunc)) %>%
  rename(gstat_adaptive = as.matrix.gi.adaptive_nonfunc.) %>%
  filter(gstat_adaptive > 3.886 | gstat_adaptive < -3.886 )

To visualise the map with statistically significant Gi statistics, we can refer to the code chunk below. I have set the mode to allow for interactive viewing - this means that we can zoom in / out of the map and click on the LGAs too.

Functional Water Points

Show the code
tmap_mode("view")
base <- tm_shape(nigeria) + 
  tm_fill(col = 'gray98') + 
  tm_borders(alpha = 0.3)

Gimap_sig_f <- base + 
  tm_shape(nigeria.gi_sig_f) +
  tm_fill(col = "gstat_adaptive", 
          style = "pretty",
          palette = "-RdBu",
          title = "Local Gi") +
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Local Gi Statistic (Sig.) Map \n (Functional WP)",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

tmap_arrange(choropleth_plot(nigeria, "pct_functional", "quantile", "Proportion of Functional water points \nin Nigeria: partitioned by 'pretty' intervals"), 
             Gimap_sig_f) 

The Gi statistics have revealed significant pockets of hot (red shades) spot areas and no cold spots. Right away, we observe that these areas are situated mostly in the northern regions. The identified of hot spots are less sparse than what we saw in the LISA map.

Looking at the hot spots, we see a relatively large hot spot areas at the top of the map. There are areas with higher intensity of functional water points, identified by the darker red shades and areas with less intensity in lighter red shades. This area borders the neighbouring country of Niger. It does seem that the intensity is higher in the outer areas before becoming less intense.

Non-functional Water Points

Show the code
tmap_mode("view")
base <- tm_shape(st_transform(nigeria)) + 
  tm_fill(col = 'gray98') + 
  tm_borders(alpha = 0.3)

Gimap_sig_nf <- base + 
  tm_shape(nigeria.gi_sig_nf) +
  tm_fill(col = "gstat_adaptive", 
          style = "pretty",
          palette = "-RdBu",
          title = "Local Gi") +
  tm_borders(alpha = 0.3) + 
  tm_layout(main.title = "Local Gi Statistic (Sig.) Map \n (Non-functional WP)",
            main.title.size = 1,
            main.title.position = "center",
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE)

tmap_arrange(choropleth_plot(nigeria, "pct_nonfunctional", "quantile", "Proportion of Non-functional water points \nin Nigeria: partitioned by 'pretty' intervals"), 
             Gimap_sig_nf) 

The Gi statistics have revealed significant pockets of hot (red shades) and cold (blue shades) spot areas. At first glance, we can see that there are mostly cold spots on the top half of the map and mostly hot spots on the bottom of the map. This pattern is consistent with the LISA map, although the extent of hot/cold areas are different.

On the top right area, we can see a cold spot. It includes Bama, Dikwa, Mafa and Monguno with more intense clustering (darker shade of blue). They are situated towards the outer areas / near borders. The area beyond the dark blue shaded region are LGAs without water points at all (no data recorded). This may signify the importance of these water points and probably and importance to ensure they are all functional. The lower percentage of non-functional water points could also be because of the lower number of water points in the area, which increase the importance of existing water points. Towards the left, we have Gumel and Maigatari, Miga and Jahun that are also cold spots.

Looking at the hot spots, we see some at the bottom right of the map. At the coastal border, we have more intense clustering (dark red), suggesting higher number of non-functional water points in that area. Towards the mainland, the clustering is less intense but with pockets of dark reds.

6 Conclusion

In our study of the functional and non-functional water points in Nigeria using ESDA, we utilised exploratory spatial data analysis methods and visualisation tools to generate insights about the situation–more that what we could have found out using only traditional choropleths. Using Global Spatial Autocorrelation Statistics, we found out that there is existence of spatial autocorrelation in both cases of functional and non-functional water points. This led to us using local spatial autocorrelation statistical methods like Local Moran’s, Anselin Moran Scatterplot, LISA cluster maps and Getis and Ord’s G-Statistics to under areas of clusters or outliers and hot or cold spots.

Our cluster and outlier analysis on functional water points found clusters in the northern part of the map to have higher proportion of functional water points and in the southern part of the map to have less proportion of the same. The same analysis on non-functional water point found clusters on the northern part of Nigeria to have less proportion of non-functional water points and the clusters in the southern part to have higher proportion of the same.

Our hot and cold spot analysis of functional water points generated one large hot spot at the northern region of the map. The same analysis on non-functional water points found areas of hot spot (higher proportion of non-functional water points) towards the southern portion of the map and some cold spots towards the northern parts.

The two analyses conducted using different statistical methods yield different results. Visually we can observe that the hot/cold spots identified in Gi Map were less sparse as compared to the LISA clusters map. A common observation is that in both, the middle / central parts of Nigeria were not identified as clusters, the northern part is identified as a cold spot and cluster and the southern part, a hot spot and cluster for non-functional water points.

Overall, we can see that the southern areas have higher proportion of non-functional water points and northern areas have higher proportion of functional water points. This may gives valuable insights to inform public policy and improve the water situation.

6.1 Future Works

We can explore the impact of living near the borders–either to the neighbouring countries or coastal areas. It would also be interesting to deep dive into location that are near rivers like the Niger or Benou.

It would also be interesting to bring in the water sources e.g., tap, well, etc., into the dataset and see if is any relationship between type of water source and its functional status. This applies for other demographic factors like population size, income level and education level.

Footnotes

  1. https://www.worldbank.org/en/topic/water/overview↩︎

  2. https://www.unwater.org/publications/un-world-water-development-report-2022↩︎

  3. https://www.waterpointdata.org/about/↩︎

  4. Okorie PN, Ademowo GO, Saka Y, Davies E, Okoronkwo C, Bockarie MJ, Molyneux DH, Kelly-Hope LA. Lymphatic filariasis in Nigeria; micro-stratification overlap mapping (MOM) as a prerequisite for cost-effective resource utilization in control and surveillance. PLoS Negl Trop Dis. 2013 Sep 5;7(9):e2416. doi: 10.1371/journal.pntd.0002416. PMID: 24040432; PMCID: PMC3764235.↩︎

  5. https://www.unicef.org/nigeria/press-releases/nearly-one-third-nigerian-children-do-not-have-enough-water-meet-their-daily-needs↩︎

  6. Mack, Vincent & Kam, Tin Seong. (2018). Is There Space for Violence?: A Data-driven Approach to the Exploration of Spatial-Temporal Dimensions of Conflict. 1-10. 10.1145/3282933.3282935.↩︎

  7. https://geodacenter.github.io/workbook/4a_contig_weights/lab4a.html#rook-contiguity)↩︎

  8. Chapter 2. Codifying the neighbourhood structure of Handbook of Spatial Analysis: Theory and Application with R.↩︎

  9. https://swampthingecology.org/blog/geospatial-data-analysis-in-rstats.-part-2/↩︎

  10. Chapter 2. Codifying the neighbourhood structure of Handbook of Spatial Analysis: Theory and Application with R.↩︎

  11. https://geodacenter.github.io/workbook/5a_global_auto/lab5a.html#spatial-correlogram↩︎

  12. http://ceadserv1.nku.edu/longa/geomed/ppa/doc/LocalI/LocalI.htm#:~:text=Local%20Moran’s%20I%20is%20a,spatial%20association%20or%20LISA%20statisti↩︎

  13. https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/h-how-cluster-and-outlier-analysis-anselin-local-m.htm↩︎

  14. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_variogram_details31.htm↩︎

  15. Salubi, E.A., Elliott, S.J. Geospatial analysis of cholera patterns in Nigeria: findings from a cross-sectional study. BMC Infect Dis 21, 202 (2021). https://doi.org/10.1186/s12879-021-05894-2↩︎

  16. Getis, A., & Ord, K. (1992). “The Analysis of Spatial Association by Use of Distance Statistics”. Geographical Analysis, 24, 189–206↩︎